Cladistics (1996) 12:65-82 



ON COMBINING PROTEIN SEQUENCES AND NUCLEIC ACID 

SEQUENCES IN PHYLOGENETIC ANALYSIS: 

THE HOMEOBOX PROTEIN CASE 

Donat Agosti', David Jacobs^ and Rob DeSalle' 

^Department of Entomology, American Museum of Natural History, New York, NY 
10024, U.S.A. and '^Department of Biology, University of California, Los Angeles CA, 

U.S.A. 

Received for publication 28 June 1995; accepted 20 December 1995 

Abstract — Amino acid encoding genes contain character state information that may be useful 
for phylogenetic analysis on at least two levels. The nucleotide sequence and the translated 
amino acid sequences have both been employed separately as character states for cladistic 
studies of various taxa, including studies of the genealogy of genes in multigene families. In 
essence, amino acid sequences and nucleic acid sequences are two different ways of character 
coding the information in a gene. Silent positions in the nucleotide sequence (first or third 
positions in codons that can accrue change without changing the identity of the amino acid 
that the triplet codes for) may accrue change relatively rapidly and become saturated, losing 
the pattern of historical divergence. On the other hand, non-silent nucleotide alterations and 
their accompanying amino acid changes may evolve too slowly to reveal relationships among 
closely related taxa. In general, the dynamics of sequence change in silent and non-silent 
positions in protein coding genes result in homoplasy and lack of resolution, respectively. We 
suggest that the combination of nucleic acid and the translated amino acid coded character 
states into the same data matrix for phylogenetic analysis addresses some of the problems 
caused by the rapid change of silent nucleotide positions and overall slow rate of change of 
non-silent nucleotide positions and slowly changing amino acid positions. One major 
theoretical problem with this approach is the apparent non-independence of the two sources 
of characters. However, there are at least three possible outcomes when comparing protein 
coding nucleic acid sequences with their translated amino acids in a phylogenetic context on 
a codon by codon basis. First, the two character sets for a codon may be entirely congruent 
with respect to the information they convey about the relationships of a certain set of taxa. 
Second, one character set may display no information concerning a phylogenetic hypothesis 
while the other character set may impart information to a hypothesis. These two possibilities 
are cases of non-independence, however, we argue that congruence in such cases can be 
thought of as increasing the weight of the particular phylogenetic hypothesis that is 
supported by those characters. In the third case, the two .sources of character information for 
a particular codon may be entirely incongruent with respect to phylogenetic hypotheses 
concerning the taxa examined. In this last case the two character sets are independent in that 
information from neither can predict die character states of the other. Examples of these 
possibihties are discussed and the general applicabihty of combining these two sources of 
information for protein coding genes is presented using sequences from the homeobox 
region of 46 homeobox genes from Drosophila melanogaster to develop a hypothesis of 
genealogical relationship of these genes in this large multigene family. 

© 1996 The Willi Hennig Society 

Introduction 

When molecular information for protein coding regions is used in phylogenetic 
analysis either nucleic acid or amino acid sequences are used as the source of 
information for character coding. These two types of sequence are simply two diff- 
erent ways of character coding the same information. In many studies, variations 
on these two themes are employed, such as the use of character transformation 

0748-3007/96/010065+18/$18.00/0 © 1996 The Willi Hennig Society 



66 D. AGOSTI ET AL. 

matrices for the amino acid sequences (PROTOPARS; Felsenstein, 1989) or com- 
plete elimination of the information present in third positions of the nucleic acid 
sequences. The use of nucleic acids as the source of information for character 
coding is most common in studies of mitochondrial cytochrome b sequences 
(Brown et al., 1982; Meyer et al., 1990; Meyer and Wilson, 1990; Normarck et al., 
1991; Irwin et al, 1991; Hedges et al., 1991; Cracraft and Helm-Bychowski, 1992; 
Cantatore et al., 1994) and other mitochondrial protein coding genes (Thomas 
and Beckenbach, 1989; Brower and DeSalle, 1994; Brown et al., 1994; Miyamoto 
and Fitch, 1995). Some of these aforementioned studies are of closely related taxa 
where very few changes have occurred in the protein sequence, so inferences 
based on amino acid character states would be uninformative. 

Other studies of deeper taxonomic divergences using mitochondrial protein 
coding sequences translate nucleic acid triplets into their corresponding amino 
acids and use the amino acid information as the source for character coding. Some 
mitochondrial cytochrome b studies involving deep divergences of taxa have used 
this strategy (Meyer and Wilson, 1990; Normarck et al, 1991), but perhaps the 
most visible use of this approach is in several studies focusing on the reconstruc- 
tion of genealogical relationships among homeobox genes and other developmen- 
tal genes (Kappen et al., 1989; Schughart et al., 1989; Sommer et al, 1992; Burglin, 
1994; Doyle, 1994). We point out that these two approaches are ends of weighting 
extremes in the a priori assessment of character coding. In one case (nucleic acid 
parsimony) , the researcher chooses to give extreme weight to the primary nucleic 
acid information, while in the other (amino acid parsimony) the primary nucleic 
acid information is ignored and instead extreme weight in coding is given to the 
amino acid information. 

Many researchers suggest that there are problems with both of these extreme 
choices of character coding that can be traced to what we think we know about the 
genetic code. In what follows we discuss these problems and suggest that in 
addition to the extreme choices of amino acid parsimony and nucleic acid parsi- 
mony and the types of a priori weighting approaches that are currently available 
(Simon et al., 1994) lie possibihties that have been only narrowly explored. The 
purpose of this paper is to examine the behavior of a novel method of character 
coding which is based on the combination of the nucleic acid and the amino acid 
sequence of the same gene. In this paper we are particularly interested in focusing 
these approaches on the reconstruction of the genealogy of important develop- 
mental genes, the homeoboxes (Burglin, 1994; Duboule, 1994; Gehring et al., 
1994; Ruddle et al., 1994). This is an important and interesting endeavor due to 
their regulatory role which specifies unique identities to specific body compart- 
ments along the anteroposterior axis (see Gehring et al., 1994 for a review). In 
order to obtain a resolved, well-supported hypothesis of the relationship between 
these genes, the efficient extraction of the maximum amount of information for 
character coding is necessary. This necessity is due to the small amount of poten- 
tial character state information from these short (180 nucleotides and 60 amino 
acids) coding regions and the lack of reliable potential information about regions 
outside these highly conserved 180 base pair sequences. 

The nature of the triplet code and the well characterized degeneracy of third 
position sites in the triplets explain the more frequent third position changes rela- 
tive to amino acid changes, most of which require a first or second position 



PROTEIN PHYLOGENIES 67 

change. The degeneracy of the code results in a system where amino acids can be 
coded for by a single triplet (M and W), two triplets (C, D, E, F, H, K, N, Q and Y), 
three triplets (I), four triplets (A, G, P, T and V) and six triplets (L, R and S) and 
has resulted in these amino acids being termed one-fold, two-fold, three-fold, four- 
fold and six-fold degenerate, respectively. This understanding of the genetic code 
has led many researchers to suggest that certain a priori assumptions can be made 
concerning the phylogenetic analysis of these sequences. For instance, researchers 
have often claimed that, due to the degeneracy of the genetic code, third positions 
in codons of the nucleic acid sequences should be down weighted relative to 
second and first positions. While this approach may be justifiable at some level, the 
actual weights for implementing this strategy are the real problem because the 
researcher needs to make an arbitrary choice of weighting degree, perhaps based 
on the nucleotide composition, perhaps based on rates of change (Brower and 
DeSalle, 1994; Simon etal., 1994). Either way, an a /)non assumption must be made 
concerning the characters and the taxa involved. On the other hand, protein parsi- 
mony transformation matrices can be used to character weight amino acid 
sequences (PROTOPARS; Felsenstein, 1989). This method takes advantage of the 
degeneracy of the genetic code and the biochemical similarity of certain codons to 
make assumptions concerning the probability of amino acid transformations. Due 
to the more conserved nature of change of amino acids with respect to nucleic 
acids, fewer changes are observed with amino acid character state information, 
often resulting in less resolution of terminals when using this information in 
cladistic analyses. 

In this paper we describe the combination of nucleic acid sequences with their 
translated amino acid sequences into a single data matrix as a possible method for 
character analysis that lies between the two extremes of amino acid parsimony and 
nucleic acid parsimony. We first consider two related aspects of this data combi- 
nation. The first concerns the implications of character weighting nucleic acid and 
amino acid data (i.e. the actual combination of the two character sets and what 
mechanistically occurs when these two character sets are combined) and the 
second concerns the non-independence of these two character sets that we suggest 
be combined. 

Character Weighting and Three Possible Outcomes of Combining Amino Acid and 
Nucleic Acid Sequences from the Same Protein Coding Region 

When these two sets of information for the same coding region are combined, 
three possible outcomes arise. First, each amino acid and the information in its 
associated triplet may be entirely congruent and equally informative with respect 
to each other. Second, one of the sources of information will show no informa- 
tiveness while the other will. Finally, the two sources of information — a nucleic 
acid triplet and its amino acid — may be informative but incongruent. 

In the first case listed above, the combination of nucleic acid and amino acid 
information results in giving greater weight to those nucleic acid columns that are 
congruent with their corresponding amino acids. This effect can best be under- 
stood by examining the mechanics of character coding and character weighting 
using a simple example (Fig. la). This hypothetical example shows a character 
matrix for three ingroups and an outgroup for three characters coded for pres- 



68 D. AGOSTI ET AL. 

Fig. 1 . (a) Hypothetical data set with three ingroup taxa (A, B and C) and an outgroup. Character 
states for three characters (1,2 and 3) are shown. The last column shows the character states that are 
added to a matrix when Char 1 is given a weight of two relative to Char 2. 





Charl 


Char2 


CharS 


Weighting column 


Taxon A 


1 


1 


1 


1 


Taxon B 


1 





1 


1 


Taxon C 














OUTGROUP 















(b) Hypothetical data set with three ingroup taxa (A, B and C) and an outgroup. Nucleic acid character 
states for three characters (codon 1, codon 2 and codon 3) are shown with their corresponding amino 
acids in parentheses. The last column shows the weighting column that is effectively added to a matrix 
when amino acid coded characters are combined with the nucleic acid characters. Note that nucleic 
acid codon 1 is given a weight of 2 relative to nucleic acid characters 2 and 3. For details see text. 





codon 1 


codon 2 


codon 3 


Weighting column 


Taxon A 


AAA{K) 


AAA(K) 


TGC (C) 


1 1 1 


Taxon B 


AAA(K) 


AAA(K) 


AGC (S) 


1 1 


Taxon C 


AAC(N) 


AAG(K) 


TGT (C) 


1 1 


OUTGROUP 


AAC(N) 


AAG(K) 


TCT (S) 


1 



ence (1) and absence (0). Suppose that there is some a priori reason to give more 
weight to the first character and suppose further that we determine it should be 
weighted 2:1 relative to the other characters. A character state column for the four 
taxa can be added to the data matrix to signify the 2:1 weighting of this character 
(Fig. la). We see the addition of the amino acid information to the nucleic acid 
information behaving in the same way (codon 1 in Fig. lb). 

Addition of the amino acid coded characters in the second and third cases listed 
above adds no weight to the existing nucleic acid coded characters (Fig. lb). For 
codon 2 in Figure lb addition of the amino acid coded characters adds no infor- 
mation to the data matrix (all encoded amino acids are K) , but does not conflict 
with the nucleic acid information. In this case no new information is added to the 
matrix when the amino acid coded character states are combined with the nucleic 
acid character states. For the third codon in Figure lb, the first two positions are 
uninformative. The first position shows an autapomorphy for taxon B and the 
second position is autapomorphic for the ABC group. The third position, however, 
supports the grouping of taxa A and B while the amino acid coded character sup- 
ports the grouping of A and C. In this case the information added by including the 
amino acid coded data is new and conflicts with the nucleic acid information. 

One of the desirable aspects of this approach is its ability to give weight to diff- 
erent parts of the codons and different weight to different codons. In the majority 
of coding schemes using codon information, the three nucleotide codon positions 
are given weight wholescale as three classes. For example, second positions are 
often given some arbitrary but uniform high weight relative to the other two pos- 
itions that are also given uniform but lower weights. When amino acid information 
is combined with nucleic acid information the weights depend on the congruence 
of the nucleotide position in a codon with the corresponding amino acid position. 
The ability to apply weights differentially and not uniformly has been suggested to 
be not only desirable but a necessary application of the philosophy of cladistic par- 
simony (Goloboff, 1991, 1993). 

A novel effect that the combination of these two types of data produces is 
differential weighting to specific parts of a tree from a single character. In all other 



PROTEIN PHYLOGENIES 



69 





codon 1 


aal 


codon 2 


aa2 


codon 3 


aa3 


TaxonA 


AAA 


K 


GAA 


E 


TGG 


W 


TaxonB 


AAA 


K 


GAA 


E 


TGG 


W 


TaxonC 


AAC 


N 


GAG 


D 


TGC 


C 


TaxonD 


AAT 


N 


GAT 


D 


TGT 


C 


OUTGROUP 


ACG 


T 


GOG 


A 


TCA 


s 



aal aa2 aa3 



OUT 



nal na2 naS 



aal aa2 aa3 



TaxonA 
TaxonB 
TaxonC 
TaxonD 



Fig. 2. Hypothetical example showing decoupling of amino acid information and nucleic acid 
information and differential weights in clade (A, B) versus clade (C, D). Hypothetical nucleic acid data 
(in the codon 1, codon 2 and codon 3 columns) and the translated amino acid data (in the aal, aa2 
and aa3 columns). Four taxa and an outgroup (OUT) are shown for both amino acid characters and 
nucleic acid characters. The tree below the character states represents the most parsimonious solution 
for the amino acid data. Taxa C and D would appear unresolved in the parsimony analysis of the 
nucleic acid data. Characters supporting each node are indicated on the branches leading to the node. 
Changes from the outgroup to the ingroup taxas and autapomorphies for taxa A through D are not 
shown. 



cases of character weighting the character is weighted uniformly across all parts of 
a tree. Figure 2 shows another hypothetical example, this time with character state 
information for three nucleic acid codons and the corresponding amino acids. 
Note that two clades are hypothesized on the basis of these hypothetical data, that 
clade 1 (taxa A and B) is supported by three nucleotide positions and all three 
amino acid positions, and that clade 2 (taxa C and D) is supported by three amino 
acid positions and none of the informative nucleic acid positions. 



Character Independence 

If the two types of character states are independent, then combination of the 
two types of information using equal weights would a priori be warranted (Nixon 
and Carpenter, 1995). However, the amino acid and nucleic acid character states 
cannot be considered entirely independent because, given the nucleic acid charac- 
ter states, one can always reconstruct exactly the amino acid character states. How- 
ever, changes reflected on the phylogenetic tree at a single amino acid position in 
a molecule will not always be reflected by changes in the three characters that 
make up the triplet that codes for the amino add (see codon 3 in Fig. lb). Thus, 
the amino acid data can behave in partial independence from each constituent 
nucleotide position. In addition, selection for particular protein structure operates 
directly on the amino acids so the amino acids reflect molecular function. Codon 
evolution is affected by a number of additional factors such as codon bias. Thus, 



70 D. AGOSTI ET AL. 

the nucleotide data are likely to contain independent historical information not 
reflected in the amino acids. We suggest that, due to the diflferent evoludonary 
constraints on the information inherent in the amino acid and nucleic acid charac- 
ters, combination of the two character codings allows us to discover congruent 
character state patterns in the two types of character information. This congruence 
is reflected in an up-weighting of those congruent characters. When the amino 
acid and nucleic acid character states conflict, congruence with other codon pos- 
itions then decides the final phylogenetic hypothesis. In these cases of conflict 
combination of the two types of character coding implements a more complete 
representation of the data. 



Materials and Methods 

Gene Sequences 

The method of combining amino acid sequences with nucleic acid sequences is 
demonstrated in this paper using Drosophila melanogaster homeoboxes. DNA 
sequences were obtained from GENBANK using Burglin (1994), Duboule (1994), 
Gehring et al. (1994) and Ruddle et al. (1994) as guides for the available homeo- 
boxes in the literature. Because Drosophila homeobox sequences are continually 
being reported in the literature, we established June 1994 as an arbitrary cutoff 
date for obtaining these sequences. More extensive and inclusive studies of these 
genes are currently under way and the present study is presented as a first approxi- 
mation of the genealogy of this multi-gene family. Another criterion that we used 
for inclusion of a homeobox sequence in the data set was the availability of its 
nucleotide sequence in the literature or in GENBANK. This criterion excluded to 
homeoboxes whose nucleic acid sequences have not been reported (Ell and 
W13). Amino acid sequences were obtained by translating conceptually the DNA 
sequences using the computer program SCLONE (version 1.05; Biocode, 1989). 
For each homeobox with a nucleotide sequence two files were archived and used 
in subsequent analyses. The data matrix used in this study is available by anony- 
mous ftp (science@amnh.org). 



Phylogenetic Analyses 

Alignment of the 60 amino acids and 180 nucleotides of the homeoboxes for 
which nucleotide sequence was available was assumed to be trivial. Analysis of the 
nucleic acid character set and the amino acid character set were performed separ- 
ately and, in addition, the two character sets were combined (see character 
weighting in section below) . 

Cladistic analysis of large numbers of taxa (in the present study we substitute 
taxa for homeoboxes) is problematic and exhaustive solutions in realistic com- 
puter time for tree searches using cladistic methods can be accomplished for up to 
10 or 11 taxa. Exact branch and bound solutions are possible for data sets of 
around 20 taxa depending on the degree of homoplasy in the data set. With the 
large number of homeobox genes in the Drosophila homeobox data set, we are for- 
ced to use heuristic searches. Although heuristic searches do not guarantee that 



PROTEIN PHYLOGENIES 71 



Weights on first, second and 
third codon positions 



Amino Acid (AA) Parsimony 1:1 Combined Analysis Nucleic Acid (NA) Parsimony 



3:1 AA Combined Analysis Weighting 



PROTOPARS 

Fig. 3. Schematic diagram showing the range of possibihties for weighting schemes in protein coding 
nucleic acid sequences. AA parsimony is the case where amino acids are weighted extremely relative to 
nucleic acids. NA parsimony is the case where nucleic acid characters are weighted extremely relative to 
amino acids. The numerical ratios of 3:1 was also examined in this study; amino acid data set weight is 
given first. The bracket indicates where the PROTOPARS (Felsenstein, 1989) method lies in this range 
of applied weights. 



the shortest tree will be obtained, certain randomized input order of taxa pro- 
cedures are recommended to enhance the search (Farris, 1990; Swofford and 
Olsen, 1990; Maddison and Maddison, 1992). 

The nucleic acid coded data were analysed with PAUP (version 3.0; Swofford, 
1994), HENNIG86 (version 1.0; Farris, 1988) and NONA (version 1.0; Goloboff, 
1994) to serve as cross-checks for obtaining shortest trees. The amino acid coded 
data were analysed only with PAUP and NONA, due to the fact that the current 
versions of HENNIG86 cannot accommodate characters with over ten character 
states. To search for trees, using PAUP, 100 replicadons of random additions of 
taxa with TBR (Tree Bisection and Reconnection) branch swapping was 
implemented. No hmit was placed on the number of trees saved at each step or for 
each replication. NONA runs used 100 replications of random additions of taxa, 
but saved only 20 trees at the end of each run. The trees were rooted with MATal, 
a yeast mating type homeobox. 

Character Weighting 

We first performed straight amino acid parsimony and nucleic acid parsimony. 
Next we simply combined the amino acid matrix with the nucleic acid matrix so 
that each nucleotide position in a codon has equal weight with its amino acid. We 
also used the combined matrix with amino acids weighted 3:1 relative to nucleic 
acids. This analysis was used so that each nucleotide in a triplet has one third 
weight as its amino acid. Figure 3 shows a diagram of the four schemes that we 
employed. 

Consistency OF Phyi.ogenetic Hypotheses from the Different Analwes 

Consistency of the various hypotheses generated from the four phylogenetic 
analyses was assessed by constructing a PAUP file of trees obtained from the 
various analyses. Strict consensus trees were constructed using PAUP from the 
trees in these files to assess the retention of monophyletic groups in the various 
analyses. 



72 



D. AGOSTI ET AL. 



(a) 



abdA 

Ubx 

ftz 

Antp 

Scr 

DfD 

AbdB 

pb 

zenl 

zen2 

lab 

cad 

Ar 

otd 

gsbBSH9 

gsbBSH4 

■ prd 

■ repo 
Bap 

•NK2 

■ tin 

■ msh 
■bed 

■ en 

■ inv 

■ eve 
•H2.0 
• ro 
■bsh 
■Dll 

■ BarHl 

■ BarH2 

- ems 

- NK1/S59 

- ap 

- zfli2-I 

- zfli2III 

- zfh2-II 
-zfhl 
-Cfla 

- pdml 

- pdm2 

- I-POU 

- cut 

- sineoculis 

- EXD-DPBX 
-MATal 



Fig. 4. (a) Strict consensus tree of the four parsimony trees generated using nucleic acid parsimony. 
Tree statistics are given in the text. Abbreviations for the various genes follow Gehring et al. (1994). 
Some of the genes have only an abbreviated name. abd-A abdominalA; Abd-B Abdominal-B; Antp 
Antennapedia; ap; Ar Aristaless; bap bagpipe; BarHl; BarH2; bed bicoid; bsh brain specific homeobox; 
cad caudal; Cfla; cut; Dfd Deformed; Dll Distal-less; ems empty spiracles; en engrailed; eve even- 
skipped; EXD- DPBX Extra dentricles; ftz fushi tarazu; gsbBSH4 gooseberry; gsbBSH9 gooseberry; 
H2.0; in invected; 1-POU Pou-box homeobox; lab labial; NK1/S59; NK2; msh muscle specific 
homeobox; prd paired; otd orthodentricle; pb proboscipedia; pdml; pdm2; repo; ro rough; Scr Sex 
combs reduced; sineoculis; Ubx Ultrabithorax; zenl zerknullt 1; zen2 zerknullt 2; zfhl zinc finger 
homeobox; zfli2-l; zfg2-ll; zfh2-ni. 



Results and Discussion 



Tree Topologies 

Forty-six homeobox sequences were obtained from GENBANK (see Fig. 4a) and 
a nucleic acid and amino acid file were constructed using these sequences. The 



PROTEIN PHYLOGENIES 



73 



(b) 




repo 

ap 

zfti2-I 

zfh2II 

zfh2-III 

sineoculis 

zfhl 

cut 

Cfla 

I-POU 

pdml 

pdm2 

EXD-DPBX 

MATal 



Fig. 4. (b) Strict consensus tree of the four parsimony trees generated using amino acid parsimony. 
Tree statistics are given in the text. 



yeast homeobox MAT al was used as an outgroup. The nucleic acid file consisted 
of 180 characters (each with four possible character states reflecting the four poss- 
ible nucleic acids) and the amino acid file consisted of 60 characters (each of 20 
possible character states reflecting the 20 possible amino acids) . Nucleic acid parsi- 
mony analysis resulted in the tree shown in Figure 4a. This tree is a strict consensus 
of four parsimony trees (steps=2205; CI=0.203; Rl=0.397) generated by parsimony 
searches. NONA, HENNIG86 and PAUP gave these same five parsimony trees in 
independent runs. The amino acid parsimony tree is shown in Figure 4b where 
three parsimony trees (steps=867; CI=0.493; Rl=0.520) were used to construct this 
consensus tree. 

For the analysis of the combined data set where each set is given equal weight we 



74 



D. AGOSTI ET AL. 



abdA 

Ubx 

ftz 

Antp 

Scr 

DfD 

lab 
■pb 

zenl 

zen2 

cad 

AbdB 

Ar 

otd 

gsbBSH9 

prd 

gsbBSH4 

repo 

NK1/S59 

ems 

eve 

ro 

H2.0 

BarHl 

BarH2 
■bsh 

Bap 

NK2 

tin 

msh 

bed 

en 

■ inv 
Dll 
ap 
zfh2-I 

■ zfh2III 

■ zfh2-II 
•zfhl 

Cfla 
pdml 

■ pdm2 
I-POU 
cut 

• sineoculis 
EXD-DPBX 

• MATal 



Fig. 5. Single most parsimonious tree using the combined matrix with the amino acids weighted 
equally with nucleic acids. Tree statistics are given in the text. This tree is our preferred phylogenetic 
hypothesis. 

simply merged the amino acid file with the nucleic acid file to give a data matrix 
with 240 characters. The combined data set gave the single parsimony tree shown 
in Figure 5. Analysis of the combined data set using two different programs 
(NONA and PAUP) generated the same single shortest tree (Figure 5; steps=3094; 
CI=0.283; RI=0.421). The analysis where amino acid characters were weighted 3:1 
relative to each nucleic acid is shown in Figure 6a. This consensus tree was gener- 
ated from nine parsimony trees (steps=4848, CI=0.357; RI=0.449). 



Consensus Anal\ses of the Phylogenetic Hypotheses from the Weighting Schemes 

Seventeen parsimony trees were generated using the four weighting schemes 



PROTEIN PHYLOGENIES 75 

described above and shown in Figure 3; three parsimony trees from the amino acid 
parsimony analysis, four parsimony trees from nucleic acid parsimony, one parsi- 
mony tree from the analysis where amino acids and nucleic acids are weighted 1:1 
and nine parsimony trees from the analysis where amino acids are weighted 3:1 
relative to nucleic acids. 

Figures 4-6 demonstrate that the realm of possible parsimony solutions is large 
for just the four weighting schemes we used. In order to detail more precisely this 
range we have constructed consensus trees of the trees obtained in the various 
weighting schemes to demonstrate which groups of homeobox genes are retained 
over the range of these weighting schemes. Since we have no a priori perception of 
the correct genealogy of these homeobox gene regions, the use of congruence 
methods (HilHs, 1995; Miyamoto and Fitch, 1995; Wheeler et al., 1995) to assess 
robustness of the resultant hypotheses is not possible. These consensus trees are 
presented to demonstrate which monophyletic groups are stable in the various 
weighting schemes. 

First we show the entire range of effects by constructing a consensus tree from 
the 17 trees obtained from the four analyses (summarized in Fig. 6b). This analysis 
demonstrates the disparate information extracted from the four weighting 
schemes due to the small number of nodes that remain resolved in the consensus 
cladogram. However, some of the groups of homeoboxes that are retained in the 
consensus tree in Figure 6b make good sense with respect to molecular funcdon, 
in general there is a lack of consensus of the parsimony trees from the four 
weighting schemes. We examined the consensus of the trees generated from 
nucleic acid parsimony (three trees) and amino acid parsimony (four trees), given 
that these are considered the two ends of the spectrum of weighting (Fig. 3). Not 
surprisingly, the consensus tree from this analysis is very similar to the consen.sus 
tree of all 1 7 parsimony trees demonstrating that the addition of the two sets of 
weighted parsimony trees are no more incongruent with the amino acid parsimony 
trees or the nucleic acid parsimony trees than those two are to each other. In fact, 
addition of the ten parsimony trees from the two weighting analyses to the trees 
from the extreme weighting schemes (nucleic acid parsimony and amino acid 
parsimony) resulted in the collapse of only two nodes (arrows shown in Fig. 6b). 
The results of constructing these consensus trees from the extreme weighting 
schemes suggests that there are inherently different types of information present 
in the two data sets. The fact that some groups are retained in the two extreme 
weighting schemes (nucleic acid and amino acid parsimony) indicates that the 
characters that support these groups are those that will be weighted in the analyses 
where the two data sets are combined. 

We next compared the agreement of the 1:1 combined analysis (using equal 
weight to each character set) with the extreme weighting schemes (nucleic acid 
parsimony and amino acid parsimony). The combined weighting schemes using 
equal weights for the two character sets produced parsimony trees that were some- 
what inconsistent with the amino acid parsimony trees (Fig. 6c). This consensus 
tree (Fig. 6c) is only marginally more resolved than the consensus trees when all 17 
trees from the four weighting schemes are considered (16 resolved nodes in Fig. 6c 
and 13 resolved nodes in Fig. 6b). When the equally weighted combined analysis 
trees are compared to the nucleic acid parsimony trees a higher degree of consen- 
sus is observed (28 resolved nodes in Fig. 6d). The lack of consensus among the 



76 



D. AGOSTI ET AL. 



COM 



03 COM S 13 _ ^ •-'H'-' E-2.H<N 

^p<!SaiP<;< MMao SP5Z43 6mmj3 j= So S S.S Sffi^Z ag N S connnOm aad 




PROTEIN PHYLOGENIES 



77 



ax 



coPiS<:a}Q<; o^ aS S<o MwaSpaZB bpqcQjsjjQ £ S.S Stil SZ « n n n n 







Hint- miStii-S™ 
< bo a boo 



iM !M eg (sO g g 



aomZ--C fiP5mjij2 So S o.B £ cKmm fez « n n n u .j, a a 3 






a a; 
-S 

Isl 

^ (« C 
" 3 M 

■a 2 T3 
s g| 

1 8 B 

^ .u u 

D " -S 

'jn a , 

sax; 

2 « c 

§1 § 
^|.§ 

.y o i2 
^ (J rt 

B !« 

O .u 

td ^ |C 

.SP a 3 

[X. o c 



78 D. AGOSTI ET AL. 

amino acid parsimony trees and the combined analyses trees coupled with the 
higher degree of consensus among the combined analysis trees and the nucleic 
acid parsimony trees suggests that the nucleic acid characters are more consistent 
with respect to the combined character matrix. 

The final aspect of these weighting schemes that we examined concerned the 
effect of higher amino acid weights. This analysis indicates that differentially 
weighting the data sets indeed has an effect on which groups are retained under 
the different weighting schemes. The consensus tree has 25 resolved nodes which 
is only four less than the consensus tree in Fig. 6d. However, the nodes that are 
common in making the results from the two weighting schemes consentaneous are 
more toward the tips of the tree, with those nodes that are lost in the consensus 
being nearer the base of the tree. Again, many of the groups retained in these con- 
sensus trees are reasonable with respect to biological function and molecular struc- 
ture. For example abd-A, Abd-B, Antp, Dfd, ftz, lab, pb, Scr, Ubx, zenl and zen2 
from a monophyletic group. These genes performed similar biological roles in 
determining parasegmental identities along the anterior-posterior axis. Also the 
Cfla, I-POU, pdml and pdm2 form a monophyletic group. These genes all have 
POU specific regions outside their homeobox and serve similar biological 
functions. 



Range of Applicability 

In the preceding sections we have examined the behavior of sequence infor- 
mation in homeobox genes from Drosophila melanogaster in reconstructing the gen- 
ealogy of this multigene family. We have taken amino acid parsimony and nucleic 
acid parsimony as the ends of two extremes, but advocate the combination of 
nucleic acid data with the information in the amino acid character set. We also 
examined the effects of differentially weighting the two character sets. We chose 
equal weights for the two character sets and 3:1 weights for the amino acids over 
the nucleic acids. We suggest that the 1:1 weighting scheme is sounder on first 
principles (Nixon and Carpenter, 1995), more conservative and therefore more 
appropriate. The 1:1 weighting scheme relies less on weighting the amino acid 
sequences than the 3:1 scheme so if there is a non-independence effect concern- 
ing the nucleic acid and amino acid character sets, then giving equal weights to the 
two character sets is more desirable. 

In the introduction, we alluded to the range of applicability of this method. 
Obviously, for data sets with closely related taxa, silent third and first positions will 
be most likely to change and the amino acid sequences of all taxa will probably be 
identical. However, we suggest that even in some data sets with closely related taxa 
that show variation in amino acid sequence, addition of the amino acid coded 
characters may result in giving more weight to some of the nucleic acid characters 
that impart structure to a phylogenetic hypothesis. Undoubtedly though, the most 
useful application of this approach will be in studies of deep taxonomic diver- 
gence. The divergence of some of the members of the gene family that we exam- 
ine in the present study are most likely to have been very ancient events. Ancient 
divergences have been particularly difficult to resolve and diagnose for many 
studies using nucleic information and, consequently, translated amino acids infor- 
mation has been used to infer phylogeny in these studies (Boorstein et al., 1994; 



PROTEIN PHYLOGENIES 79 

Doyle, 1994; Fang and Branhorst, 1994; Hamishoto et al., 1994; Hughes, 1994; 
Yokoyama, 1994). In most of these studies the amino acid sequences have diverged 
considerably. Highly divergent amino acid sequences are more likely to have a 
weighting effect on nucleic acid sequences in combined nucleic acid and amino 
acid sequence analyses. 



A Preliminary Ci assification of Fly Homeoboxes 

Previous attempts to examine the classification of homeobox sequences have 
used amino acid sequences as the primary source of data and, in addition, have 
used phenetic approaches to implement tree construction (Kappen et al., 1989; 
Btirglin, 1994). Our approach of combining nucleic acid sequence data with the 
amino acid data of the encoded gene is very promising for the examination of phy- 
logeny in this multigene family. The discovery of certain groups of genes in the 
phylogenetic analysis that we deem most appropriate (Fig. 5) that are widely 
accepted as having a common evolutionary origin or funcdonal similarity is also 
suggestive of the validity of this approach. For instance, the ANT-C and BX-C 
homeoboxes appear in a monophyletic group that also includes Zenl, Zen2, ftz 
and cad (Fig. 5); the discovery of several other sister pairs of homeoboxes that are 
well established (see Duboule, 1994) such as en/inv, ro/eve, bap/tin, 
pdml/pdm2, zenl/zen2, BarHl/BarH2 and the gsb/rpd pairs also suggests that 
the approach is retrieving genealogical information. 

Comparison of the topology of our tree to Burglin's (1994) distance tree for all 
available homeoboxes reveals several areas of agreement for the grouping of these 
gene regions. For instance, our HOM-C clade is retained in the Biirghn (1994) 
topology, although there are differences in the order of branching of members 
within this group. The zinc finger-POU clade is also retained. Identical topolog)' 
occurs for the members within the POU clade for both the Burglin (1994) tree 
and our preferred tree in Figure 5. In addition, several smaller groups are retained 
in both trees such as the sister relationships of eve and ro, BarHl and BarH2, 
gsbBSH9 and prd, en and inv, bap and NK2, abdA and Ubx, and zenl and zen2. 
Despite these areas of agreement, imposing the Burglin topology on our com- 
bined data set, results in a tree that is 81 steps longer than our parsimony tree in 
Figure 5. The larger number of steps is partially the result of differences in hypo- 
thesized relationships within our clade than contains the Dll, bsh, ems and NKl 
clades (Fig. 5). 



Conclusion 

All forms of character weighting {a priori, a posteriori and in medias res) , as articu- 
lated in Wheeler (1990), Albert and Mishler (1992), are cited as being at the heart 
of many systematic controversies (Farris, 1983; Felsenstein, 1989; Albert and 
Mishler, 1992; Miyamoto and Cracraft, 1992; Allard, 1993; Chippendale and 
Wiens, 1994; Huelsenbeck et al., 1994). Goloboff (1993; p. 83; his italics) nicely 
articulates the logical necessity for character weighting from Farris' (1983) expla- 
nation: ". . . it is not that parsimony does not preclude weighting, but rather that it 
requires weighting." Consequently, given the logical necessit>' or requirement of 



80 D. AGOSTI ET AL. 

character weighting, methods for accomplishing objective character weights are 
highly desirable. 

Arguments have been made concerning protein coding genes and the genetic 
code with respect to character weighting. These arguments suggest that because of 
our knowledge of nucleic acid and amino acid change and of the dynamics of the 
genetic code this allows for the incorporation of models to implement a priori 
weighting of nucleotide positions in protein coding sequences. The method we 
present in this communication does not make the assumptions inherent in those 
models but rather incorporates the actual molecular biological information that 
these models attempt to incorporate. In this sense the method we have described 
is more conservative and more objective than methods where a model is used to 
incorporate this information. In essence, the effects of combining amino acid 
coding and nucleic acid coding of the same source of characters can be summar- 
ized as follows. Three outcomes of combining these two types of character codes 
were described. In case 1 there is complete congruence of the nucleic acid and 
amino acid character states. In this case, this information is given greatest weight. 
In the second case where one character coding has no information and the other 
does, the uninformative character does not conflict with its companion but gets 
less weight than the first case described above. In the final case where the infer- 
ences based on amino acid and nucleic acid information are incongruent there is 
conflict and these characters get even less weight than in the first two cases. 



Acknowledgements 

We thank the following people for their patience and criticism in reading earlier 
drafts of this manuscript: James Carpenter, Ward Wheeler, Andrew Brower, Paul 
Goldstein, Elizabeth Bonwitch, Ranhy Bang, Thomas Burglin and Markus Affolter. 
We also thank P. Goloboff for the special NONA version. 



REFERENCES 

Albert, V. A. and B. D. Mishi.er. 1992. On the rationale and utility of weighting nucleotide 
sequence data. Cladistics 9: 73-83. 

AuARD, M. W. 1993. Nuances of Nucleotides. Cladistics 9: 115-127. 

BooRSTEiN, W. R., T. ZiEGELHOFFER AND E. A. Craig. 1994. Molecular evolution of the HSP70 
multigene family. J. Mol. Evol. 38: 1-17. 

Brower, A. V. Z. and R. Desalle. 1994. Practical and theoretical consideration for choice of a 
DNA sequence region in insect molecular systematics with a short review of pub- 
lished studies using nuclear gene regions. Ann. Entomol. Soc. Am. 87: 702-716. 

Brown, W. M., E. M. Prager, A. Wang and A. C. Wilson. 1982. Mitochondrial DNA sequences 
of primates: tempo and mode of evolution. J. Mol. Evol. 18: 225-239. 

Brown, J. M., O. Pellmyr, J. N. Thompson and R. G. Harrison. 1994. Phylogeny of Oreya 
(Lepidoptera: Prodoxidae) based on nucleotide sequence variation in mitochondrial 
cytochrome oxidase I and II: congruence with morphological data. Mol. Biol. Evol. 
11:128-141. 

Burglin, T. R. 1994. A comprehensive classification of homeobox genes. In: D. Duboule 
(ed.). Guidebook to the Homeobox genes. Oxford University Press., New York, 
pp. 25-71. 

Cantatore, p., M. Robert], G. Pesole, A. Ludovico, F. Milella, M. N. Gadaletaand C. Saccone. 



PROTEIN PHYLOGENIES 81 

1994. Evolutionary analysis of cytochrome b sequences in some Perciformes: Evi- 
dence for a slower rate of evolution than in mammals. J. Mol. Evol. 39: 589-597. 

Chippendale, P. T. and J. J. Wiens. 1994. Weighting, partitioning, and combining characters 
in phylogenetic analysis. Syst. Biol. 43: 278-287. 

Cracraet, J. and K. Helm-Bychowski. 1992. Parsimony and phylogenetic inference using DNA 
sequences: some methodological strategies. In: M. M. Miyamoto and J. Cracraft 
(eds), Phylogenetic analysis of DNA sequences. Oxford University Press, New York, 
pp. 184-220. 

Doyle, J. J. 1994. Evolution of a plant homeotic multigene family: Toward connecting mol- 
ecular systematics and molecular developmental genetics. Syst. Biol. 43: 307-328. 

Duboule, D. (ed.) 1994. Guidebook to the Homeobox genes. Oxford University Press, New 
York. 

Fang, H. and B. P. Brandhorst. 1994. Evolution of Actio gene families of , sea urchins. J. Mol. 
Evol. 39: 347-356. 

Farms, J. 1983. The logical basis of phylogenetic analysis. In: N. Platnick and V. Ftmk (eds). 
Advances in Cladistics. Proceedings of the second meeting of the Willi Hennig 
Society. Vol. 2: Columbia Univ. Press, New York: 7-36. 

Farms, J. 1988. Hennig86. Ver. 1.5. Port Jefferson, New York. 

Farris, J. 1990. Phenetics in camouflage. Cladisdcs 6: 91-100. 

Felsenstein, J. 1989. PHYLIP — Phylogenetic inference package. Ver. 3.2. Cladistics 5: 
164-166. 

Gehmng, W. J., M. Affolter and T. Blirglin. 1994. Homeodomain proteins. Annu. Rev. 
Biochem. 63: 437-526. 

GoLOBOFF, P. A. 1991. Random data, homoplasy and information. Cladistics 7: 395-406. 

Goloboff, p. a. 1993. Estimating character weights during tree search. Cladistics 9: 83-91. 

GoLOBOFF, P. A. 1994. NONA. Ver. 2.0. The American Museum of Natural History, New 
York. 

Hashimoto, T., Y Nakamura, F. Nakamura, T. Shirakura, J. Adachi, N. Goto, K. Okamoto and 
M. Hasegawa. 1994. Protein phylogenies gives a robust estimation for early diver- 
gences of Eukaryotes: Phylogenetic place of a mitochondrial-lacking protozoan, Giar- 
dia lambia. Mol. Biol. Evol. 11: 65-71. 

Hedges, S. B., R. L. Bezyand L. R. Maxson. 1991. Phylogenetic relationships and biogeogra- 
phy of xanthusiid lizards, inferred from mitochondrial DNA sequences. Mol. Biol. 
Evol. 8: 767-780. 

HiLLis, D. M. 1995. Approaches for assessing phylogenetic accuracy. Syst. Biol. 44: 3-16. 

Hughes, A. L. 1994. Evolution of the ATP-binding-cassette transmembrane transporters of 
vertebrates. Mol. Biol. Evol. 11: 899-910. 

HUELSENBECK, J. P., D. T. SwOFFOM), C. W. CUNNINGHAM, J. J. BuLL .A.ND P.J. WaDDELL. 1994. Is 

character weighting a panacea for the problem of data heterogeneity in phylogenetic 

analysis? SysL Biol. 43: 288-291. 
Irwin, D. M., T. D. Kocher.and A. C. Wilson. 1991. Evolution of the cytochrome b gene in 

mammals. J. Mol. Evol. 32: 128-144. 
K,APPEN, C, K. Sghughart and F. H. Ruddle. 1989. Evolutionary origin of Homeodomain 

classes. Proc. Nat. Acad. Sci. USA 86: 5459-5463. 
Maddison, W. H. and D. R. Maddison. 1992. MacClade: Ver. 3.0. Analysis of phylogeny and 

character evolution. Sinauer Associates, Sundering, Massachusetts. 
Meyer, A., T. D. Koc:her, P. Basabaw.am and A. C. Wilson. 1990. Monophyletic origin of 

Victoria cichlid fish suggested by mitochondrial DNA sequences. Nature (Lond.) 

347: 550-553. 
Meyer, A. and A. C. Wilson. 1990. Origins of tetrapods inferred from their mitochondrial 

DNAaffihation to lungfish.J. Mol. Evol. 31: 359-364. 
Miyamoto, M. M. and J. Cr.acraft (eds) 1992. Phylogenetic analysis of DNA sequences. 

Oxford University Press, New York. 358 pp. 
Miyamoto, M. M. and W. M. Fitch. 1995. Testing species phylogenies and phylogenetic 

methods with congruence. Syst. Biol. 44: 64—76. 
Nixon, K. C. and J. D. Carpenter. (Submitted). On simultaneous analysis. Cladistics. 
Normarck, B. B., a. R. McCune and R. G. Harmson. 1991. Phylogenetic relationships of 



82 D. AGOSTI ET AL. 

neopterygian fishes, inferred from mitochondrial DNA sequences. Mol. Biol. Evol. 8: 
819-834. 

Ruddle, F. H., J. L. Bartels, K. L. Bentley, C. Kappen, M. T. MurthaandJ. W. Pendleton. (In 
press) . Evolution of Hox genes. Annu. Rev. Genet. 28. 

ScHUGHART, K., C. Kappen AND. F. H. RuDDLE. 1989. Duplication of large genomic regions 
during the evolution of vertebrate homeobox genes. Proc. Nat. Acad. Sci. USA 86: 
7067-7071. 

ScLONE, 1989. SuperCLONE. Ver. 1.0. Biocode. 

Simon, C., F. Frati, B. Beckenbach, B. Crespi, H. Liu and P. Floor. 1994. Evolution, weighting, 
and phylogenetic utility of mitochondrial gene sequences and a compilation of con- 
served polymerase chain reaction primers. Ann. Entomol. Soc. Am. 87: 651-701. 

SoMMER, R., M. Retziaff, K. Goerlich, K. Sander AND D. Tautz. 1992. Evolutionary conser- 
vation pattern of zinc-finger domains of Drosophila segmentation genes. Proc. Nat. 
Acad. Sci, USA 89: 10782-10786. 

SwoFFORD, D. L. 1994. PAUP-Phylogenetic analysis using parsimony. Ver. 3.0. Illinois Natural 
History Survey, Champaign, Illinois. 

Swofford, D. L. and G.J. Olsen. 1990. Phylogeny reconstruction. In: D. M. HiUis and C. Mor- 
itz (eds). Molecular Systematics. Sinauer, Sunderland, Massachusetts, pp. 411-501. 

Thomas, W. K. and A. T. Beckenbach. 1989. Variation in salmonid mitochondrial DNA: evol- 
utionary constraints and mechanisms of substitution. J. Mol. Evol. 29: 233-245. 

Wheeler, W. C. 1990. Combinatorial weights in phylogenetic analysis: a statistical parsimony 
procedure. Cladistics 6: 269-275. 

Wheeler, W. C, J. Gatesyand R. Desalle. 1995. Elision: A method for accommodating mul- 
tiple molecular sequence alignments with alignment-ambiguous sites. Mol. Phylo- 
Genet. Evol. 4: 1-9. 

YoKOYAMA, S. 1994. Gene duplications and evolution of the short wavelength-sensitive visual 
pigments in vertebrates. Mol. Biol. Evol. 11: 32-39. 



